COMPUTER ALGEBRA IN SYSTEMS BIOLOGY 

REINHARD LAUBENBACHER AND BERND STURMFELS 



1. Introduction 

Molecular biology has undergone a dramatic revolution during the second 
half of the twentieth century, exemplified by the discovery of the structure of 
DNA and the sequencing of the human genome. Since then a series of tech- 
nological advances has given experimentalists the ability to make ever-more 
detailed measurements of an increasing number of molecular components of 
the cell. DNA microarrays, for instance, are small silicon chips spotted with 
short segments of DNA that can be used to measure the activity levels of 
thousands of different genes in tissue sample extracts simultaneously. Soon 
it might be possible to make large-scale quantitative measurements in a sin- 
gle cell. Being able to take such global snapshots of molecular processes has 
opened up the possibility of studying the changes that are constantly going 
on in cells as a coherent dynamical system with intricately interacting parts. 



rather than studying the 



jarts in isolation. Thus, the new field of systems 



biology has emerged [H: 14 1. 

Biological networks tend to be highly complex, with many variables that 
interact with each other in nonlinear ways, making it difficult to study such 
systems without the help of sophisticated mathematical tools and concepts. 
It is even unclear what the right formal language should be for their de- 



scription [171 ]. A characteristic feature of systems biology research is its 
heavy use of mathematical methods. One tool which has been applied re- 
cently to biological problems is computer algebra, a field of mathematics 
that combines the ability of computers to carry out symbolic calculations 
with concepts from abstract algebra. Computer algebra has been used in 
the life sciences in a variety of ways, such as the construction of phyloge- 
netic trees encoding the evolutionary relationship between different species 
0; 0], or the construction and analysis of models of intracellular biochemical 
networks l^; 2^] . For many more such applications see [3; [3] • 
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This paper aims to introduce mathematicians to the new field of algebraic 
biology by way of a very simple (simplified), well-studied biological network, 
which can be explained and studied in this limited space. We discuss two 
published models for this network, one continuous and one discrete, and an- 
alyze them using tools from computer algebra. While the models are simple 
enough to be analyzed by hand, the goal is to illustrate the applicability of 
the techniques. And the reader can easily verify the software calculations. 
Along the way we highlight mathematical challenges and research problems. 



2. Computer Algebra 

Computer algebra provides tools for computing with symbols rather than 
with floating point numbers. Software systems for computer algebra include 
familiar commercial packages, such as Maple, Mathematica, or Magma, as well 
as a wide range of more specialized systems, many of which are free and often 
run faster on specialized tasks. One important theme in computer algebra 
is the solution of non-linear algebraic equations. In the context of systems 
biology, this problem arises when one wishes to compute the steady states 
of a dynamic model. As an example, consider the following system of two 
equations where x and y are the unknowns and ki and k2 are parameters: 

-|- kixy — 1 = y"^ + k2xy — 1 = 0. 



Using a computer algebra method known as Grobner bases [111: Il9l: l23l|. the 
two given equations can easily be rewritten in the following equivalent form: 

{kik2-l)y^ + {k^-kik2+2)y^-l = k2X+{l-kik2)y^ +{kik2-kl-l)y = 0. 

The first equation involves no x. Using the quadratic formula, we can there- 
fore express y in terms of the parameters ki,k2- The second equation gives 
X in terms of y and ki,k2- Further analysis reveals that there are always 
four real solutions {x,y) if kik2 < 1 but only two real solutions if kik2 > 1. 

For a second example, consider the following equations in 9 unknowns 
which are derived from the discrete model for the lac operon in Section [5j 

Xl = X4X5 + X4, 

X2 = Xl 

xz = Xl 

Xi = 1 

X5 = XqX-j -I- X6 -I- 2:7 -I- 1 

= x-iXs 

Xl = Xq + Xg + Xg + XgXg + XQXg + XqXq + XQXgXg 
X8 = X2 
Xg = 1 

We consider this as a system of equations over the Boolean field with two 
elements {0, 1}. The arithmetic in this field is characterized by the equation 
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1 + 1 = 0. A Grobner basis for the given system consists of the expressions 

X5, Xi + 1, X2 + 1, X3 + 1, X4 + 1,Xq + 1, + 1, + l,Xg + 1. 

(Note that this computation is carried out in the polynomial ring over the 
field with two elements, so we need to account for the relations xf = Xi, when 
making this calculation.) From this Grobner basis we can read off immedi- 
ately that the only solution to this system is the 9-tuple (1,1, 1,1,0, 1,1, 1,1). 

This answer could have been found without Grobner bases, by either 
solving the system "by hand" or by plugging in all 2^ binary vectors of 
length 9. However, the discrete dynamical systems that are of interest in 
biology are now much more complex (due to advances in the experimental 
technologies, as argued above). For such systems, naive approaches will not 
work, and more sophisticated tools, such as computer algebra, are needed 
for the analysis. For instance, one of the biochemical networks used in the 



recent DREAM competition to reverse-engineer networks from data sets j22l | 
contains 58 molecular species, including mRNA, proteins, and metabolites. 
Data from this network can be used by reverse-engineering methods, e.g., 
[l^ to construct a Boolean network model. This model would have 2^^ 
states which precludes an analysis by exhaustive enumeration of the state 
space. 

3. The Lac Operon 

We illustrate the use of computer algebra in systems biology by way of a 
gene regulatory network discovered by Jacob and Monod [13] , earning them 
the 1965 Nobel Prize in Medicine. In prokaryotic organisms, some functions 
of gene regulation are accomplished by so-called operons, groups of genes 
that are adjacent to each other on the genome and are transcribed as a 
single segment of mRNA. Operons also have control elements - transcription 
factors - that bind to regulatory elements in the DNA and activate or inhibit 
the transcription of structural genes. Transcription factors that stimulate 
transcription are called inducers. They bind to regulatory elements in DNA 
called promoters. Repressors, on the other hand, bind to elements in DNA 
called operators and are effectively preventing transcription. 

The lac operon (Figure [I]) in E. coli is one such example. It enables E. 
coli to metabolize lactose into glucose and other products, in case glucose 
is not available directly. When cells grow in glucose-based media, the activ- 
ity of the enzymes involved in the metabolism of lactose is very low, even 
if lactose is available. However, when glucose is exhausted from the me- 
dia and lactose is present, the concentration of enzymes involved in lactose 
metabolism increases. This process is called induction [l^ . 

Figured] is to be interpreted as follows. In panel A no lactose is present. 
The repressor protein R can bind to the operator region of the lac operon 
and block RNA polymerase from transcribing the operon genes. In panel B 
lactose and its isomer allolactose are present. The allolactose causes a con- 
formational change in the repressor protein which prevents it from blocking 
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Figure 1. The lac operon. See [1, p. 37]. 

transcription. Consequently, RNA polymerase transcribes the operon into 
mRNA which, in turn, is translated into the proteins Z, Y, and A. 

How is the lac operon regulated? Positive control takes place in the ab- 
sence of glucose and presence of lactose in the cell. Some of the lactose is 
converted into allolactose, which acts as an inducer of transcription of the lac 
genes. This process involves the protein CAP (catabolite activator protein), 
which forms a complex with the substrate cAMP. This complex, in turn, 
binds to the activator region of the operon. Translation of the mRNA gives 
rise to the proteins /3-galactoside permease (LacY), which transports exter- 
nal lactose into the cell, and /3-galactosidase (LacZ), which cleaves lactose 
into glucose and its stereoisomer galactose. It also converts lactose into allo- 
lactose (Fig. [HB). a third protein, LacA, is not directly involved in lactose 
metabolism. Transcription continues until glucose becomes available. When 
that happens two negative control mechanisms take over. Inside the cell, 
synthesis of the protein cAMP is inhibited, which is needed for transcription 
of the operon, and the repressor protein Lad can bind to the operator re- 
gion of the lac operon, preventing transcription (Fig. [DA). This is known as 
catabolite repression. Furthermore, external glucose inhibits the transport 
of lactose into the cell through a process called inducer exlusion. We want 
to emphasize that this description is very simplified and does not mention 
many of the biochemical mechanisms involved in this control circuit. 
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4. A Continuous Model 

As one of the oldest known gene regulation network, the lac operon has 
been studied extensively and many different mathematical models have been 
constructed for it; see [13] • The most common type of model is based on 
ordinary differential equations. For a model from the recent literature see 
[25| . We discuss here the very simple dynamical systems model described in 
Section 5.2 of the undergraduate text book It consists of three equations, 
modeling the concentration of the repressor R and the rates of change of the 
operon mRNA M and allolactose A. The three equations are: 



^ = co + c{l-R)-jM, 

vMA 

— = ML-6A-- -. 

dt h + A 

Here cq, c, 7, v, S, h and L are certain model parameters, n is a fixed positive 
integer, and the concentrations R, M and A are functions of time t. 

This model is based on several assumptions. For instance, we do not 
distinguish between intracellular and extracellular lactose, and denote both 
by L. Another assumption is that /5-galactosidase is proportional to operon 
activity M and is not represented explicitly. The concentration of the re- 
pressor R is represented by a sigmoid function, a so-called Hill function. 
When extracellular lactose is present and is transported to the intracellular 
environment by lactose permease, produced by the activity of the operon, 
the allolactose concentration increases and inhibits the repressor R. The 
Hill coefficient n determines the shape of the sigmoidal function. The larger 
n, the steeper and more switch-like the curve becomes. As we will see, this 
coefficient also determines the degree of the polynomial system we need to 
solve. Biologically, n can often be interpreted as a specific measured quan- 
tity, such as the number of molecules of a particular species that are involved 
in a cooperative reaction. 

The rate of change of the gene transcripts Af is composed of a baseline 
activity represented by the constant cq, the concentration A of allolactose 
(which inhibits the repressor R), and a degradation term jM. The concen- 
tration of allolactose A increases with the activity M of the operon genes in 
conjunction with the presence of lactose L. Its degradation (the terms on 
the right-hand side with minus signs) is represented by a Michaelis-Menten 
type enzyme substrate reaction composed of two terms. The parameters 
Co, c, 7, V, 5, h and L need to be estimated using biological considerations or 
numerical methods, to ensure that the model is consistent with experimental 
data. 

This model is also quite simplified, both from a biological and a mathe- 
matical point of view. But even a simple model can be useful. The purpose 
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of modeling is to identify the essence of a system, that is, to identify the 
components and dynamics that are key to conferring the biological function. 
This is analogous to identifying that the key component for moving the bus 
forward is the engine. The art of constructing mathematical models of bio- 
logical systems (or any other type of system, for that matter) is to incorpo- 
rate the most important features and mechanisms and discard the irrelevant 
ones. Comparing the model to the Boolean network model constructed in 
the next section, we will see certain basic similarities, even though the math- 
ematics is different. We have a time-discrete finite dynamical system on the 
one hand and a continuous-time system given by differential equations on 
the other hand. 

It is now time to analyze the dynamics of the continuous model, by com- 
puting its steady states. We do this by phrasing the problem in a way that 
makes it amenable to using algebra, namely by setting the right hand sides 
of the differential equations to zero: 

1 uM A 
Co + c • (1 - — ) - 7- M = M-L - 6-A- f = 0. 

' ^ 1 + A"^' ' h + A 

This is a system of two algebraic equations in two variables A and M, which 
depends on the various parameters. Note that the steady state values for 
the missing variable R are determined by the equation R = 1/(1 + A^). 

Following the discussion in [1, Section 5.2], we leave the concentration L 
of lactose unspecified while the other parameter values are fixed as follows: 

c = 7 = z; = 1, Co = h = 2, m = b, 5 = ^. 

2U 5 

We also set n = 5. Our algebraic equations now take the form 

1 A^ 1 MA 
— + — -M = M-L A = 0. 

20 1 + ^5 5 2 + A 

By clearing denominators and eliminating the unknown M, we find that 

4^^ + (29-21L)y46 - 42LA^ + AA^ + - L) A - 2L = 0. 

This is a polynomial of degree 7 in A. The discriminant of this polynomial 
in j4 is a complicated polynomial of degree 12 in the parameter L. This 
discriminant has precisely two positive roots, which we determine to be 

Li = 0.68454... and La = 1.51054.... 

For all values of L between Li and L2, there are three positive steady states. 
For example, if L = 1 then the steady states (i?, M, A) of our system are 

(0.2272, 0.0506, 0.9994) , (0.6907, 0.1859, 0.8642) , (2.3717, 1.0368, 0.0132). 

The above expression AA^ + (29 — 21L)A^ + • • • is the equation of the bifur- 
cation diagram in the {A, L)-plane which is depicted in [5|, Figure 5.3(b)]. It 
describes the steady-state allolactose concentration A as a function of the 
lactose concentration L. As argued in [1, Section 5.3], the emergence of these 
three steady states shows that this model correctly captures key features of 
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the lac operon. Computer algebra allows us to vary other parameters and 
enables us to conduct a very careful analysis of the dynamics of this model. 
In particular, using computer algebra, we can derive a precise algebraic de- 
scription of the region in parameter space for which the dynamical system 
has more than one stationary point, and we can identify parameter values 
at which interesting phenomena (e.g. Hopf bifurcations [13|]) might occur. 



5. A Discrete Model 

While most models of the lac operon have used the framework of ordinary 
differential equations, other modeling frameworks can also be useful. We 
describe a discrete model, in the form of a Boolean network, taken from the 



recent article j2l| ]. Like most models, it is very simplified in its representation 
of biological details and mechanisms. But it attempts only to capture a basic 
dynamic feature of the lac operon, namely its bistability. 

Each one of the variables in the model can take on the states and 1, 
corresponding to the simplifying biological assumption that the role a molec- 
ular species plays in this network depends only on its absence or presence 
(defined by a suitably chosen threshold) rather than a more refined measure 
of its concentration in the cell. The model has nine variables, represent- 
ing concentrations of various molecular species, some of which have already 
appeared in the continuous model in the previous section: 

(1) mRNA for the genes LacZ, LacY, and Lac A (M) (the value of this 
variable indicates whether the operon is ON or OFF), 

(2) lac permease (P), 

(3) /?-galactosidase (B), 

(4) catabolite activator protein CAP (C), 

(5) repressor protein Lad (R), 

(6) lactose (L) and allolactose {A), 

(7) low concentrations of lactose (L^) and allolactose (Ag). 

The two variables Li and Ag are "artifical," in the sense that, for the 
model to be accurate, it needs to allow for three, rather than two, possible 
concentration levels of lactose and allolactose: absent, low, and high. In or- 
der to avoid the use of multistate variables, we can keep the binary setting 
by introducing additional variables that account for low concentrations of 
these chemicals. The model has two parameters: external lactose a and ex- 
ternal glucose g. These are either present (1) or absent (0). The interactions 
between these molecular species are described by Boolean functions: 



(1) 


Huit + l) = 




(2) 


Hp{t + 1) = 


M{t), 


(3) 


Hsit + l) = 


M{t), 


(4) 


Hc{t + l) = 




(5) 


Hnit + 1) = 


-^A{t) A --Aeit), 


(6) 


HA{t + l) = 


L{t)AB{t), 


(7) 


HAAt + ^) = 


A{t)y L{t) y Li{t) 
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(8) HL{t + l) = ^gAP{t)Aa, 

(9) HL,it + l) = -5A(L(t) Va). 

These functions are to be interpreted as follows. The genes are transcribed 
{H]\i{t + 1) = 1) at time t + 1 if the repressor R is absent and CAP is present 
at time t; permease and /3-galactosidase are present at time t + 1 if the 
operon is transcribed at time t. The interpretation of the other functions is 
similar. The result is a time-discrete dynamical system H(t), on the space 
of binary 9-tuples, with dynamics generated by iteration of H. The model 
assumes that the molecular mechanisms leading from activation of a gene to 
the production of the corresponding protein (transcription plus translation) 
happen in one time step, as does niRNA and protein degradation. 

It is shown in th;t this simple model, b^sed on very few assumptions, 
displays a dynamic behavior that captures an essential feature of the lac 
operon, namely, its bistability. To show this, one needs to examine the long 
term dynamics, or steady states and periodic states, of the model. A state 
of the system is represented by a binary 9-tuple (M, P, B, C, R, A, Ai, L, Lg), 
such as (1, 1, 1,0, 1, 0, 0, 1, 1). If we apply the dynamical system H to this 
state, with the parameter setting a = l,g = 0, that is, external lac- 
tose is present and external glucose is absent, we obtain the next state 
(0,1,1,1,1,1,1,0,1). After iterating five more times we reach the state 
u = (1,1,1,0,1,1,1,1), which turns out to be a steady state, that is, 
H(u) = u. This state corresponds to the operon being ON, with all molec- 
ular species, except the repressor R, present. We would like to compute all 
such steady states u for the system (l)-(9). 

In the case of this model it is possible to solve the resulting system of 
9 equations by hand, as mentioned earlier. Alternatively, one can find the 
steady states by computing the value of H on all 2^ 9-tuples for each of 
the four possible parameter settings, for a total of 2048 evaluations of H. 
(For instance, the software [iH] can perform this computation.) For larger 
models it is typically not possible to do either. For example, in [20] a Boolean 
network model of T-cell receptor signaling with 94 equations was published. 
In that case, we can use computational tools provided by computer algebra, 
which provide a systematic way to solve the system of steady state equations. 
The search for effective algorithms to solve very large systems of polynomial 
equations over finite fields, and the Boolean field in particular, is currently 
an active area of research (see, e.g., [3])- We demonstrate their use with the 
lac operon model discussed here. 

We first translate the Boolean functions in the model into polynomials. 
This uses arithmetic modulo 2. To translate a Boolean function into a poly- 
nomial function, we observe first that every Boolean function is expressed 
using the logical operators A,V, and These can be translated into poly- 
nomial operations by observing that the functions a A 6 and a ■ b take on the 
same Boolean values for given values of the variables. That is, both functions 
take on the value 1 precisely if both a and b take on the value 1, otherwise 
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the functions take on the value 0. Similarly, we see that aV b = a + b + a ■ b 
and -lO = a + 1. If we apply this dictionary to the Boolean functions in the 
model, while supressing the time variable, we obtain the following: 

Hm = {R + 1)C = RC + C, 

Hp = M = Hb, 

He = g + i, 

Hr = {A + l){Ai + l) = AAe + A + Ae + l, 

Ha = LB, 

Hai = A + L + + LLi + AL + ALi + ALLi, 

Hl = {g + l)aP, 

Hl, = {g + l){L + aL + a). 

A steady state of the system for a given choice of the parameters a and g is 
one for which the functions do not change the value of the variables. It is 
therefore a solution to the system of polynomial equations 

Hm = M, . . . , Hl^ = Le- 

The system in Section [2] arises for the parameter choice g = and a = 1. 
Solving the system for all four possible parameter settings, g = = a;g = 
l,a = 0; g = 1 = a; and g = 0,a = 1, results in the four corresponding 
steady states: 

(0,0,0,1,1,0,0,0,0), (0,0,0,0,1,0,0,0,0), 
(0,0,0,0,1,0,0,0,0), (1,1,1,1,0,1,1,1,1). 

The last steady state is the one we found in Section [2] and is the only one for 
which the lac operon is ON, in agreement with the biological system. Note 
that, for each parameter setting, any initialization of the system will end in 
the corresponding fixed point. 

In [2l[ another smaller model is presented that captures some key features 
of the feedback loop structure of this model and also shows the correct 



dynamics. The main biological conclusion drawn in [21[ from the models 
is that the dynamic behavior of the lac operon is due to properties of the 
network topology rather than the specific kinetics of the network that played 
a role in the continuous model discussed in the previous section. 



6. Discussion 

It is generally agreed that modern molecular biology can benefit greatly 
from the use of mathematical techniques that allow the construction of com- 
plex system-level models of biological networks. Conversely, the problems 
that arise in today's biological research can provide important stimuli for 
mathematical research. This is aptly expressed in the title Mathematics 
is Biology's Next Microscope, Only Better; Biology is Mathematics' Next 
Physics, Only Better of [1]. We have attempted here to describe through 
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mathematical models of the lac operon how algebra can contribute to a for- 
mal description and an analytical understanding of biological phenomena. 
One goal was to show that different types of mathematical models (discrete 
and continuous) can provide insight into biological mechanisms. Further- 
more, we have demonstrated that computer algebra, not traditionally used 
in biology, is a powerful tool that can help construct and analyze biologi- 
cal models in a systematic way. Thus, this paper should be viewed as an 
advertisement for an in-depth study of the relationship between computer 
algebra, and mathematics in general, and systems biology. The marriage 
between the two promises to be extremely fruitful for both. 

One forum for such interactions is the annual international conference 
series Algebraic Biology 0] which was started in 2005. Another one is the 
year-long program on Algebraic Methods in Systems Biology and Statistics 
held at the Statistical and Applied Mathematical Sciences Institute (SAMSI) 
in North Carolina during the academic year 2008-09. 

Several open mathematical research problems arise from our discussion. 
Firstly, in both Sections 3 and 4, we used computer algebra algorithms to 
determine solutions to systems of nonlinear polynomial equations. In one 
case the polynomials have real coefficients and in the other they take binary 
values. In the binary central problem is to improve the power of 

the algorithms and the speed of the software implementations. Unlike the 
real coefficient case, relatively little work has been done until recently on 
the development of efficient solution methods for polynomial systems over 
finite fields in general, and the Boolean field in particular. Where exact 
solutions cannot be computed, information about the number of solutions, 
e.g., the number of steady states, would be of interest. Tools involving 
zeta functions of algebraic varieties over finite fields might be of help here. 
Another source of problems in computer algebra comes from the task of 
inferring gene regulatory networks from experimental data sets, for instance, 
using the technique in |l6l |. 

There are numerous open problems concerning continuous dynamical sys- 
tems in systems biology. Two problems we find particularly important are 
the characterization of chemical reaction networks that allow bistability 
and the Global Attractor Conjecture for Toric Dynamical Systems (lol |. 
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